Pinpoint Clustering of Web Pages and Mining Implicit Crossover Concepts

نویسندگان

  • Makoto HARAGUCHI
  • Yoshiaki OKUBO
چکیده

In this chapter, we present our Top-N methods for extracting clusters of documents which have originated from the article (Haraguchi, 2002). We first discuss a method for pinpoint clustering of Web pages by pseudo-clique search (Haraguchi & Okubo, 2006; Okubo et al., 2005) and then a method for finding implicit page groups (clusters) represented as formal concepts (Li et al., 2008). A huge collection of documents including pages over theWeb has been considered as an information source of knowledge. One of the core tasks of Information Retrieval (IR) is to effectively find useful and important documents from such a collection. For this purpose, many retrieval engines compute ranks of documents and show them in the order of their ranks (Page et al., 1999; Salton & McGill, 1983). Highly ranked documents are easily checked by users, while documents ranked lower are rarely examined. Any retrieval system based on document ranking has its own ranking scheme. So, even potentially interesting documents are sometimes ranked lower and are therefore actually hidden and invisible to users. In this sense, we might be missing many useful documents. If we can make such hidden significant documents visible, our chance to obtain valuable information and knowledge can be enhanced. The standard approach to cope with this problem is to use the techniques of clustering (Gan et al., 2007) by which we classify various documents into several clusters of similar documents. We pick up a few clusters that seem to be relevant, and then examine them in details to look for interesting documents. However, if the number of clusters is small, clusters tend to be larger ones involving even non-similar documents, and are hard to be examined. Conversely, if we have many clusters, it is also hard to check every cluster, although each cluster is smaller and involves only similar documents. Thus, it is not an easy task to have an adequate method for controlling the number of clusters. This has motivated us to investigate a new clustering method, Pinpoint Clustering, by which we can efficiently extract only nice clusters. We have developed some strategy in (Haraguchi & Okubo, 2006; Okubo et al., 2005) for finding only Top-N number of clusters of similar documents with respect to their evaluation values reflecting the ranks of documents in them. In the framework, the document similarity is evaluated with the help of Singular Value Decomposition(SVD) (Strang, 2003). We first extract semantic correlations among terms by applying SVD to the term-document matrix generated from a corpus with a specific topic. Then, given a set of ranked Web pages to be clustered, we evaluate potential similarities among 19

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Technique for Improving Web Mining using Enhanced Genetic Algorithm

World Wide Web is growing at a very fast pace and makes a lot of information available to the public. Search engines used conventional methods to retrieve information on the Web; however, the search results of these engines are still able to be refined and their accuracy is not high enough. One of the methods for web mining is evolutionary algorithms which search according to the user interests...

متن کامل

Use of Semantic Similarity and Web Usage Mining to Alleviate the Drawbacks of User-Based Collaborative Filtering Recommender Systems

  One of the most famous methods for recommendation is user-based Collaborative Filtering (CF). This system compares active user’s items rating with historical rating records of other users to find similar users and recommending items which seems interesting to these similar users and have not been rated by the active user. As a way of computing recommendations, the ultimate goal of the user-ba...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Expert Discovery: A web mining approach

Expert discovery is a quest in search of finding an answer to a question: “Who is the best expert of a specific subject in a particular domain within peculiar array of parameters?” Expert with domain knowledge in any field is crucial for consulting in industry, academia and scientific community. Aim of this study is to address the issues for expert-finding task in real-world community. Collabor...

متن کامل

Finding Community Base on Web Graph Clustering

Search Pointers organize the main part of the application on the Internet. However, because of Information management hardware, high volume of data and word similarities in different fields the most answers to the user s’ questions aren`t correct. So the web graph clustering and cluster placement in corresponding answers helps user to achieve his or her intended results. Community (web communit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012